Introduction to R

Matthew Talluto

21-22 Feb 2024

Introduction to R

Setting up your workspace

Step 1

Download and install R

Step 2

Download and Install Rstudio

Parts of Rstudio

After launching Rstudio, create a new R script using the button in the upper left

Script: a text file where you will write an R-program. Commands in a script will be run in order, from the top to the bottom.

Parts of Rstudio

\(^1\)Mac users: usually you can substitute the command (⌘) key for control, and option for alt

Parts of Rstudio

Helpful vocabulary

Recommendations

Variables

Recommendations

# Comments in R start with the # symbol
# anything after # will not be executed
# Comments are a useful way to annotate your code so you know what is 
# happening

# Legal variable names
x = 1
y0 = 5
time_of_day = "20:15"
dayOfWeek <- "Monday"

# bad!
# d is the diversity in our site, in species
d = 8

# better!
site_diversity = 8


# error
0y = 5
todays date = "February 21"
## Error: <text>:21:2: unexpected symbol
## 20: # error
## 21: 0y
##      ^

Data types

Numeric operators

# assignment
x = 5
# equivalent
x <- 5

# class tells you the data type of a variable
class(x)
## [1] "numeric"

# math
x + 2
## [1] 7
(3 + x) * 2
## [1] 16
3^2
## [1] 9

Functions

Mathematical & functions

Get help on a function with ? or help(), for example: ?log or help(log). If you don´t know a function’s name, you can search for a (likely/suspected) string in its name with ??.

x = 5

# The print() function takes one or more arguments
#      in this case the variable x
# It returns no value, but has the side effect of printing the 
# value of x to the screen
print(x)
## [1] 5

# compute the natural logarithm of x
# then store it in y, then print the result
y = log(x)
print(y)
## [1] 1.609438

# functions can take multiple arguments, and arguments can be named
# here we change the base of the logarithm
log(x, base = 10)
## [1] 0.69897
# equivalents, found in the help file with ?log
#     log(x, 10)
#     log10(x)

Data structures: vectors

Creating vectors

# The c() function stands for concatenate
# it groups items together into a vector
(five_numbers = c(3, 2, 8.6, 4, 9.75))
## [1] 3.00 2.00 8.60 4.00 9.75

# Creating vectors
(one_to_ten = 1:10)
##  [1]  1  2  3  4  5  6  7  8  9 10
class(one_to_ten)
## [1] "integer"
# integers are *also* numeric
is(one_to_ten, "numeric")
## [1] TRUE

## sequences and repeats
seq(1, 5, 0.4)
##  [1] 1.0 1.4 1.8 2.2 2.6 3.0 3.4 3.8 4.2 4.6 5.0
rep(0, 3)
## [1] 0 0 0

Vectorized operations


# math on vectors is performed on each element
five_numbers + 1
## [1]  4.00  3.00  9.60  5.00 10.75
five_numbers^2
## [1]  9.0000  4.0000 73.9600 16.0000 95.0625

sin(five_numbers)
## [1]  0.1411200  0.9092974  0.7343971 -0.7568025 -0.3195192

# vectors of other data types
class(five_numbers)
## [1] "numeric"
(five_letters = c("c", "b", "z", "q", "w"))
## [1] "c" "b" "z" "q" "w"
class(five_letters)
## [1] "character"
sort(five_letters)
## [1] "b" "c" "q" "w" "z"

as(five_numbers, "integer")
## [1] 3 2 8 4 9

Vectors: indexing

five_numbers
## [1] 3.00 2.00 8.60 4.00 9.75
five_numbers[1]
## [1] 3
five_numbers[3:5]
## [1] 8.60 4.00 9.75
five_numbers[c(1,4)]
## [1] 3 4

# get vector length
length(five_numbers)
## [1] 5
five_numbers[10]
## [1] NA

# can index with another variable
(i = length(five_numbers))
## [1] 5
five_numbers[i]
## [1] 9.75

# also an expression
five_numbers[i - 1]
## [1] 4

More basic functions

mean(five_numbers)
## [1] 5.47
median(five_numbers)
## [1] 4
sd(five_numbers)
## [1] 3.479152
var(five_numbers)
## [1] 12.1045
sum(five_numbers)
## [1] 27.35

# mathematical functions
exp(five_numbers)
## [1]    20.085537     7.389056  5431.659591    54.598150 17154.228809
log(five_numbers)
## [1] 1.0986123 0.6931472 2.1517622 1.3862944 2.2772673

More programming: logic and comparison

# less than, greater than, less/greater or equals
1 < 2
## [1] TRUE
class(1 < 2)
## [1] "logical"

2 <= 2
## [1] TRUE
2 > 4
## [1] FALSE
'q' >= 'm'
## [1] TRUE

# equality testing
2 == 1 + 1
## [1] TRUE
2 != 'q'
## [1] TRUE

# be careful with decimals!
0.2 + 0.1 == 0.3
## [1] FALSE
# preferred:
all.equal(0.1 + 0.2, 0.3)
## [1] TRUE

Logical operators

# and: both must be true
TRUE & TRUE
## [1] TRUE
TRUE & FALSE
## [1] FALSE
FALSE & FALSE
## [1] FALSE

# or: only one needs to be true
TRUE | FALSE
## [1] TRUE
FALSE | FALSE
## [1] FALSE

# not: returns the opposite
!FALSE
## [1] TRUE

# both statements are TRUE, so this is TRUE
1 + 1 == 2 & 2 < 4
## [1] TRUE

Reading in real data

Download the Palmer penguins dataset

Save this in your project folder, inside another folder named data

Artwork by @allison_horst

Reading in real data

"/Users/ltalluto/Desktop/intro_r/data/penguins.csv"

Recommendation: use relative paths!

Hint: use .. to refer to one directory above the current (working) directory

penguins = read.csv("data/penguins.csv")
# read.csv also takes URLs!
# penguins = read.csv("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/main/inst/extdata/penguins.csv")
class(penguins)
## [1] "data.frame"

Data prep recommendations

species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen 39.1 18.7 181 3750 male 2007
Adelie Torgersen 39.5 17.4 186 3800 female 2007
Adelie Torgersen 40.3 18.0 195 3250 female 2007
Adelie Torgersen NA NA NA NA NA 2007
Adelie Torgersen 36.7 19.3 193 3450 female 2007
Adelie Torgersen 39.3 20.6 190 3650 male 2007

Working with data frames

# Load a dataset named 'penguins' and convert it to a data frame
# this dataset comes from a package, "palmerpenguins"
data(penguins, package = "palmerpenguins")
penguins = as.data.frame(penguins)
head(penguins)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen 39.1 18.7 181 3750 male 2007
Adelie Torgersen 39.5 17.4 186 3800 female 2007
Adelie Torgersen 40.3 18.0 195 3250 female 2007
Adelie Torgersen NA NA NA NA NA 2007
Adelie Torgersen 36.7 19.3 193 3450 female 2007
Adelie Torgersen 39.3 20.6 190 3650 male 2007

Working with data frames

str(penguins)
## 'data.frame':    344 obs. of  8 variables:
##  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_length_mm   : num  39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##  $ bill_depth_mm    : num  18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##  $ flipper_length_mm: int  181 186 195 NA 193 190 181 195 193 190 ...
##  $ body_mass_g      : int  3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
##  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
##  $ year             : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
# data frame dimensions
nrow(penguins)
## [1] 344
ncol(penguins)
## [1] 8
dim(penguins)
## [1] 344   8

Working with data frames


# Error: there is no variable with that name!
bill_length_mm
## Error in eval(expr, envir, enclos): object 'bill_length_mm' not found

# accessing a variable by name
penguins$bill_length_mm
penguins[, 'bill_length_mm']
# equivalent: accessing a variable by position
penguins[,3] # get every row in the third column
##   [1] 39.1 39.5 40.3   NA 36.7 39.3 38.9 39.2 34.1 42.0 37.8 37.8 41.1 38.6 34.6
##  [16] 36.6 38.7 42.5 34.4 46.0 37.8 37.7 35.9 38.2 38.8 35.3 40.6 40.5 37.9 40.5
##  [31] 39.5 37.2 39.5 40.9 36.4 39.2 38.8 42.2 37.6 39.8 36.5 40.8 36.0 44.1 37.0
##  [46] 39.6 41.1 37.5 36.0 42.3 39.6 40.1 35.0 42.0 34.5 41.4 39.0 40.6 36.5 37.6
##  [61] 35.7 41.3 37.6 41.1 36.4 41.6 35.5 41.1 35.9 41.8 33.5 39.7 39.6 45.8 35.5
##  [76] 42.8 40.9 37.2 36.2 42.1 34.6 42.9 36.7 35.1 37.3 41.3 36.3 36.9 38.3 38.9
##  [91] 35.7 41.1 34.0 39.6 36.2 40.8 38.1 40.3 33.1 43.2 35.0 41.0 37.7 37.8 37.9
## [106] 39.7 38.6 38.2 38.1 43.2 38.1 45.6 39.7 42.2 39.6 42.7 38.6 37.3 35.7 41.1
## [121] 36.2 37.7 40.2 41.4 35.2 40.6 38.8 41.5 39.0 44.1 38.5 43.1 36.8 37.5 38.1
## [136] 41.1 35.6 40.2 37.0 39.7 40.2 40.6 32.1 40.7 37.3 39.0 39.2 36.6 36.0 37.8
## [151] 36.0 41.5 46.1 50.0 48.7 50.0 47.6 46.5 45.4 46.7 43.3 46.8 40.9 49.0 45.5
## [166] 48.4 45.8 49.3 42.0 49.2 46.2 48.7 50.2 45.1 46.5 46.3 42.9 46.1 44.5 47.8
## [181] 48.2 50.0 47.3 42.8 45.1 59.6 49.1 48.4 42.6 44.4 44.0 48.7 42.7 49.6 45.3
## [196] 49.6 50.5 43.6 45.5 50.5 44.9 45.2 46.6 48.5 45.1 50.1 46.5 45.0 43.8 45.5
## [211] 43.2 50.4 45.3 46.2 45.7 54.3 45.8 49.8 46.2 49.5 43.5 50.7 47.7 46.4 48.2
## [226] 46.5 46.4 48.6 47.5 51.1 45.2 45.2 49.1 52.5 47.4 50.0 44.9 50.8 43.4 51.3
## [241] 47.5 52.1 47.5 52.2 45.5 49.5 44.5 50.8 49.4 46.9 48.4 51.1 48.5 55.9 47.2
## [256] 49.1 47.3 46.8 41.7 53.4 43.3 48.1 50.5 49.8 43.5 51.5 46.2 55.1 44.5 48.8
## [271] 47.2   NA 46.8 50.4 45.2 49.9 46.5 50.0 51.3 45.4 52.7 45.2 46.1 51.3 46.0
## [286] 51.3 46.6 51.7 47.0 52.0 45.9 50.5 50.3 58.0 46.4 49.2 42.4 48.5 43.2 50.6
## [301] 46.7 52.0 50.5 49.5 46.4 52.8 40.9 54.2 42.5 51.0 49.7 47.5 47.6 52.0 46.9
## [316] 53.5 49.0 46.2 50.9 45.5 50.9 50.8 50.1 49.0 51.5 49.8 48.1 51.4 45.7 50.7
## [331] 42.5 52.2 45.2 49.3 50.2 45.6 51.9 46.8 45.7 55.8 43.5 49.6 50.8 50.2

Working with data frames

# accessing a variable by name, and subsetting rows
penguins[1:10,"bill_length_mm"]
##  [1] 39.1 39.5 40.3   NA 36.7 39.3 38.9 39.2 34.1 42.0

Subsetting data frames

# Get the first value in the third column
penguins[1,3]
## [1] 39.1

# Get the whole third column, then use head to print the first six values
head(penguins[,3])
## [1] 39.1 39.5 40.3   NA 36.7 39.3

# First five values of columns 3 and 4
penguins[1:5, c(3,4)]
##   bill_length_mm bill_depth_mm
## 1           39.1          18.7
## 2           39.5          17.4
## 3           40.3          18.0
## 4             NA            NA
## 5           36.7          19.3

# Can also index by name
penguins[1:5, c("bill_length_mm", "bill_depth_mm")]
##   bill_length_mm bill_depth_mm
## 1           39.1          18.7
## 2           39.5          17.4
## 3           40.3          18.0
## 4             NA            NA
## 5           36.7          19.3

# Remove the first 300 rows
penguins[-(1:300),]
##       species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 301 Chinstrap  Dream           46.7          17.9               195        3300
## 302 Chinstrap  Dream           52.0          19.0               197        4150
## 303 Chinstrap  Dream           50.5          18.4               200        3400
## 304 Chinstrap  Dream           49.5          19.0               200        3800
## 305 Chinstrap  Dream           46.4          17.8               191        3700
## 306 Chinstrap  Dream           52.8          20.0               205        4550
## 307 Chinstrap  Dream           40.9          16.6               187        3200
## 308 Chinstrap  Dream           54.2          20.8               201        4300
## 309 Chinstrap  Dream           42.5          16.7               187        3350
## 310 Chinstrap  Dream           51.0          18.8               203        4100
## 311 Chinstrap  Dream           49.7          18.6               195        3600
## 312 Chinstrap  Dream           47.5          16.8               199        3900
## 313 Chinstrap  Dream           47.6          18.3               195        3850
## 314 Chinstrap  Dream           52.0          20.7               210        4800
## 315 Chinstrap  Dream           46.9          16.6               192        2700
## 316 Chinstrap  Dream           53.5          19.9               205        4500
## 317 Chinstrap  Dream           49.0          19.5               210        3950
## 318 Chinstrap  Dream           46.2          17.5               187        3650
## 319 Chinstrap  Dream           50.9          19.1               196        3550
## 320 Chinstrap  Dream           45.5          17.0               196        3500
## 321 Chinstrap  Dream           50.9          17.9               196        3675
## 322 Chinstrap  Dream           50.8          18.5               201        4450
## 323 Chinstrap  Dream           50.1          17.9               190        3400
## 324 Chinstrap  Dream           49.0          19.6               212        4300
## 325 Chinstrap  Dream           51.5          18.7               187        3250
## 326 Chinstrap  Dream           49.8          17.3               198        3675
## 327 Chinstrap  Dream           48.1          16.4               199        3325
## 328 Chinstrap  Dream           51.4          19.0               201        3950
## 329 Chinstrap  Dream           45.7          17.3               193        3600
## 330 Chinstrap  Dream           50.7          19.7               203        4050
## 331 Chinstrap  Dream           42.5          17.3               187        3350
## 332 Chinstrap  Dream           52.2          18.8               197        3450
## 333 Chinstrap  Dream           45.2          16.6               191        3250
## 334 Chinstrap  Dream           49.3          19.9               203        4050
## 335 Chinstrap  Dream           50.2          18.8               202        3800
## 336 Chinstrap  Dream           45.6          19.4               194        3525
## 337 Chinstrap  Dream           51.9          19.5               206        3950
## 338 Chinstrap  Dream           46.8          16.5               189        3650
## 339 Chinstrap  Dream           45.7          17.0               195        3650
## 340 Chinstrap  Dream           55.8          19.8               207        4000
## 341 Chinstrap  Dream           43.5          18.1               202        3400
## 342 Chinstrap  Dream           49.6          18.2               193        3775
## 343 Chinstrap  Dream           50.8          19.0               210        4100
## 344 Chinstrap  Dream           50.2          18.7               198        3775
##        sex year
## 301 female 2007
## 302   male 2007
## 303 female 2008
## 304   male 2008
## 305 female 2008
## 306   male 2008
## 307 female 2008
## 308   male 2008
## 309 female 2008
## 310   male 2008
## 311   male 2008
## 312 female 2008
## 313 female 2008
## 314   male 2008
## 315 female 2008
## 316   male 2008
## 317   male 2008
## 318 female 2008
## 319   male 2008
## 320 female 2008
## 321 female 2009
## 322   male 2009
## 323 female 2009
## 324   male 2009
## 325   male 2009
## 326 female 2009
## 327 female 2009
## 328   male 2009
## 329 female 2009
## 330   male 2009
## 331 female 2009
## 332   male 2009
## 333 female 2009
## 334   male 2009
## 335   male 2009
## 336 female 2009
## 337   male 2009
## 338 female 2009
## 339 female 2009
## 340   male 2009
## 341 female 2009
## 342   male 2009
## 343   male 2009
## 344 female 2009

Logical subsets

# Return a data frame with only one species
penguins_adelie = subset(penguins, species == "Adelie")
head(penguins_adelie)
##   species    island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1  Adelie Torgersen           39.1          18.7               181        3750
## 2  Adelie Torgersen           39.5          17.4               186        3800
## 3  Adelie Torgersen           40.3          18.0               195        3250
## 4  Adelie Torgersen             NA            NA                NA          NA
## 5  Adelie Torgersen           36.7          19.3               193        3450
## 6  Adelie Torgersen           39.3          20.6               190        3650
##      sex year
## 1   male 2007
## 2 female 2007
## 3 female 2007
## 4   <NA> 2007
## 5 female 2007
## 6   male 2007

Logical subsets

# Try these, what do they do?
subset(penguins, species == "Adelie" & sex == "Male" & year != 2007)

subset(penguins, species %in% c("Chinstrap", "Gentoo"))

Logical subsets

# get the body mass of all female penguins with a larger-than-average bill length
penguins[penguins$sex == "female" & penguins$bill_length_mm > mean(penguins$bill_length_mm, na.rm = TRUE), "body_mass_g"]
##  [1]   NA 4500 4450 4550 4800 4650 4200 4800 5000 4400   NA 4600 5050 5150 4350
## [16] 4300 4200 5100 4850 4400 4900 4200 4400 4700   NA 4750 5200 4700 4600 4750
## [31] 4625 4725 4750 4875 4950 4750 4850 4875 4625 4850 4975   NA 5000 4375   NA
## [46] 4925   NA 4850 5200 3500 3525 3950 3250 4150 3800 3700 3575 3700 3450 3300
## [61] 3400 3700 3900 3850 2700 3650 3500 3675 3400 3675 3325 3600 3250 3525 3650
## [76] 3650 3775

Practise with data frames

Read in the penguin data using read.csv. Store it in a variable named penguins. Then try to do the following.

  1. How many rows and columns are in the dataset? Use dim, nrow, and ncol.
  2. What is the output of mean(penguins$bill_depth_mm)? Why? What if you try mean(penguins$bill_depth_mm, na.rm = TRUE)? Check the help file by typing ?mean. Does the same logic work for sd?
  3. How many penguins were captured in each year? Try the table function.
  4. You can use table with multiple columns of data. Try to find out how many of each species were captured in each year.
    • Hint: use c() inside [] to get multiple columns. See the previous slide.
  5. Add a new column called body_mass_kg, and compute the body mass, in kg instead of g.
    • Hint: you can use $ to create a new column by name
  6. Use subset to compute the mean bill length and bill depth for Gentoo penguins

Practise with data frames

## 1.
# dim gives rows and columns, in that order
nrow(penguins)
## [1] 344
ncol(penguins)
## [1] 8
dim(penguins)
## [1] 344   8

## 2.
# there are missing values, so this doesn't work
mean(penguins$bill_depth_mm)
## [1] NA
# adding na.rm = TRUE gives us what we expect
mean(penguins$bill_depth_mm, na.rm = TRUE)
## [1] 17.15117
sd(penguins$bill_depth_mm, na.rm = TRUE)
## [1] 1.974793

## 3.
table(penguins$year)
## 
## 2007 2008 2009 
##  110  114  120

## 4. 
table(penguins[, c('species', 'year')])
##            year
## species     2007 2008 2009
##   Adelie      50   50   52
##   Chinstrap   26   18   24
##   Gentoo      34   46   44

## 5.
# nothing is printed! but the new column is created with the data already stored
# can confirm by looking at the data frame with head
penguins$body_mass_kg = penguins$body_mass_g / 1000
head(penguins)
##   species    island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1  Adelie Torgersen           39.1          18.7               181        3750
## 2  Adelie Torgersen           39.5          17.4               186        3800
## 3  Adelie Torgersen           40.3          18.0               195        3250
## 4  Adelie Torgersen             NA            NA                NA          NA
## 5  Adelie Torgersen           36.7          19.3               193        3450
## 6  Adelie Torgersen           39.3          20.6               190        3650
##      sex year body_mass_kg
## 1   male 2007         3.75
## 2 female 2007         3.80
## 3 female 2007         3.25
## 4   <NA> 2007           NA
## 5 female 2007         3.45
## 6   male 2007         3.65


## 6.
penguins_gentoo = subset(penguins, species == "Gentoo")
mean(penguins_gentoo$bill_length_mm, na.rm = TRUE)
## [1] 47.50488
mean(penguins_gentoo$bill_depth_mm, na.rm = TRUE)
## [1] 14.98211

Cleaning up the penguin data

Our data frame has some missing data. Let’s remove the incomplete cases using the complete.cases function.

# Try this: what is the output?
complete.cases(penguins)
# what is this doing? why does this work?
sum(! complete.cases(penguins))
## [1] 11
# Again, break this down one piece at a time
# start inside the brackets and work out
penguins = penguins[complete.cases(penguins), ]

Getting summary statistics

Getting summary statistics

# one variable, one partition
tapply(penguins$bill_length_mm, penguins$species, mean, na.rm = TRUE)
##    Adelie Chinstrap    Gentoo 
##  38.82397  48.83382  47.56807

# one variable, two partitions
tapply(penguins$bill_length_mm, penguins[, c('species', 'sex')], 
       mean, na.rm = TRUE)
##            sex
## species       female     male
##   Adelie    37.25753 40.39041
##   Chinstrap 46.57353 51.09412
##   Gentoo    45.56379 49.47377

Getting summary statistics

# compute the sd of every variable
pvars = penguins[, c('bill_length_mm', 'bill_depth_mm', 
                     'flipper_length_mm', 'body_mass_g')]
sapply(pvars, sd)
##    bill_length_mm     bill_depth_mm flipper_length_mm       body_mass_g 
##          5.468668          1.969235         14.015765        805.215802

Summary statistics: practise

Summary statistics: practise

# Subset the data to look only at Adelie penguins.
penguins_adelie = subset(penguins, species == "Adelie")

# Does the distribution of flipper length of Adelie penguins vary
# among the different islands?
# we can use tapply with the different functions

## mean
tapply(penguins_adelie$flipper_length_mm, 
       penguins_adelie[, c('island', 'sex')], mean)
##            sex
## island        female     male
##   Biscoe    187.1818 190.4091
##   Dream     187.8519 191.9286
##   Torgersen 188.2917 194.9130

## sd
tapply(penguins_adelie$flipper_length_mm, 
       penguins_adelie[, c('island', 'sex')], sd)
##            sex
## island        female     male
##   Biscoe    6.744567 6.463517
##   Dream     5.510156 6.803749
##   Torgersen 4.638958 5.915412

## median
tapply(penguins_adelie$flipper_length_mm, 
       penguins_adelie[, c('island', 'sex')], median)
##            sex
## island      female  male
##   Biscoe       187 191.0
##   Dream        188 190.5
##   Torgersen    189 195.0

## min
tapply(penguins_adelie$flipper_length_mm, 
       penguins_adelie[, c('island', 'sex')], min)
##            sex
## island      female male
##   Biscoe       172  180
##   Dream        178  178
##   Torgersen    176  181

## max
tapply(penguins_adelie$flipper_length_mm, 
       penguins_adelie[, c('island', 'sex')], max)
##            sex
## island      female male
##   Biscoe       199  203
##   Dream        202  208
##   Torgersen    196  210

Summary statistics: practise

# a boxplot is a much nicer way to visualise this
boxplot(flipper_length_mm ~ island+sex, data = penguins_adelie)

Summary statistics: practise

# a boxplot is a much nicer way to visualise this
# the defaults are quite ugly, let's improve things
# I chose colours using colorbrewer
# https://colorbrewer2.org/#type=qualitative&scheme=Paired&n=7
# 3 colors, one for each island
cols = c("#a6cee3", "#fb9a99", "#fdbf6f")
boxplot(flipper_length_mm ~ island+sex, data = penguins_adelie, 
        at = c(1.6,2,2.4,    3.6,4,4.4), # at controls x-location, allows grouping by sex
        col = rep(cols, 2), # set the colours, repeated twice (female and male)
        names = c("", "Female", "", "", "Male", ""), # labels under the boxes
        xlab = "", ylab = "Flipper length (mm)", # set axis titles
        bty = 'n', #disable the box around the plot
        notch = TRUE,
        boxwex = 0.3 # thinner boxes
)
# add a legend to the plot
legend("topleft", legend = unique(penguins_adelie$island), 
       title = "Island", fill = cols, bty = 'n')

Graphics in R

R currently has two dominant graphical engines

Base graphics

ggplot2

Histograms: show the distribution of a single variable

hist(penguins$bill_depth_mm)

## do this once!
# install.packages("ggplot2")

## Once per session
library("ggplot2")

ggplot(penguins, aes(x = bill_depth_mm)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Histograms: show the distribution of a single variable

Defaults are not amazing, some improvements needed

hist(penguins$bill_depth_mm, main = "", xlab="Bill depth (mm)", 
     ylab = "Frequency", breaks = 20, col="#999999", border=NA)

Histograms: show the distribution of a single variable

ggplot(penguins, aes(x = bill_depth_mm)) + 
    geom_histogram(bins = 20) + 
    theme_minimal() + xlab("Bill depth (mm)")

Colours

head(colors())
## [1] "white"         "aliceblue"     "antiquewhite"  "antiquewhite1"
## [5] "antiquewhite2" "antiquewhite3"

Basic statistics: comparing means

penguins_gentoo = subset(penguins, species == "Gentoo")
tapply(penguins_gentoo$body_mass_g, penguins_gentoo$sex, mean)
##   female     male 
## 4679.741 5484.836

Comparing means: more than 2 groups

Exercise: comparing means

Hypothesis 1: in Gentoo penguins, males are heavier than females

  1. Try to visualise the distribution of body masses in males and females using histograms. Does it look like there is a difference?
  2. Use t.test to test the hypothesis. Check the help file to get started.

Hypotheses 2 & 3

  1. Create a boxplot (or try a violin plot using geom_violin) visualising the differences in body mass among these 2 different factor variables.
  2. Use aov to perform an analysis of variance for each hypothesis. What do you conclude?

Hints

Comparing 2 histograms

Base graphics

Easiest just to do this in 2 panels. I set the axis limits so we have a fair comparison.

# create a multipanel figure, organised by rows, with 1 row and 2 colums
par(mfrow = c(1,2))
hist(subset(penguins_gentoo, sex == "female")$body_mass_g, 
     main = "Female Gentoo", col = 'pink', 
     border = NA, xlab = "Body Mass (g)", xlim = c(3800, 6500))
hist(subset(penguins_gentoo, sex == "male")$body_mass_g, 
     main = "Male Gentoo", col = 'lightblue', 
     border = NA, xlab = "Body Mass (g)", xlim = c(3800, 6500))

Comparing 2 histograms

GGplot

ggplot allows for fancier options

ggplot(penguins_gentoo, aes(x = body_mass_g, fill = sex)) + 
    geom_histogram(bins = 20, alpha = 0.8, position = "identity") + 
    theme_minimal() + xlab("Body Mass (g)") +
    scale_fill_manual("sex", values = c("pink", "lightblue"))

Running a t.test

t.test(body_mass_g ~ sex, data = penguins_gentoo, alternative = "less")
## 
##  Welch Two Sample t-test
## 
## data:  body_mass_g by sex
## t = -14.761, df = 116.64, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group female and group male is less than 0
## 95 percent confidence interval:
##       -Inf -714.6651
## sample estimates:
## mean in group female   mean in group male 
##             4679.741             5484.836

Boxplot: distribution of a y-variable across categorical x-variables

library(RColorBrewer)
(cols = brewer.pal(3, "Paired"))#[c(1, 5, 7)])
## [1] "#A6CEE3" "#1F78B4" "#B2DF8A"

boxplot(body_mass_g ~ species+sex, data = penguins, 
        # at controls x-location, allows grouping by sex
        at = c(1.6,2,2.4,    3.6,4,4.4), 
        # set the colours, repeated twice (female and male)
        col = rep(cols, 2), 
        # axis label text size
        cex.axis = 0.8,
        # labels under the boxes
        names = c("", "Female", "", "", "Male", ""), 
        # set axis titles
        xlab = "", ylab = "Body mass (g)", 
        # disable the box around the plot
        bty = 'n', 
        notch = TRUE,
        # thinner boxes
        boxwex = 0.3 
)
# add a legend to the plot
legend("topleft", legend = unique(penguins$species), 
       title = "Species", fill = cols, bty = 'n')

Boxplots: useful for comparisons among multiple groups

ggplot(penguins, aes(y = body_mass_g, x = sex, 
                            fill = species)) + 
    geom_boxplot(notch = TRUE) + 
    theme_minimal() + ylab("Body Mass (g)") + xlab("") +
    scale_fill_brewer(palette = "Paired") 

Violin plot: boxplot alternative that shows more information

# another option!
ggplot(penguins, aes(y = body_mass_g, x = sex, 
                            fill = species)) + 
    geom_violin() + 
    theme_minimal() + ylab("Body Mass (g)") + xlab("") +
    scale_fill_brewer(palette = "Paired") 

Correlation

# If you use plot on a data frame, you can view multiple relationships at once
plot(penguins[, 3:6])

Correlation

# I use round to make the printing nicer
round(cor(penguins[, 3:6]), digits = 3)
##                   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## bill_length_mm             1.000        -0.229             0.653       0.589
## bill_depth_mm             -0.229         1.000            -0.578      -0.472
## flipper_length_mm          0.653        -0.578             1.000       0.873
## body_mass_g                0.589        -0.472             0.873       1.000
# If you use plot on a data frame, you can view multiple relationships at once
plot(penguins[, 3:6])

Correlation

# I chose the weakest correlation
cor(penguins$bill_depth_mm, penguins$bill_length_mm)
## [1] -0.2286256
cor.test(penguins$bill_depth_mm, penguins$bill_length_mm )
## 
##  Pearson's product-moment correlation
## 
## data:  penguins$bill_depth_mm and penguins$bill_length_mm
## t = -4.2726, df = 331, p-value = 2.528e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3280409 -0.1242017
## sample estimates:
##        cor 
## -0.2286256
# magic spell to give each species a different color
cols = factor(penguins$species) 
plot(penguins$bill_depth_mm, penguins$bill_length_mm,
     pch = 16, #dots instead of circles
     col = cols, xlab = "Bill depth (mm)", 
     ylab = "Bill length(mm)", bty = 'n'
)
legend("topleft", legend = levels(cols), fill = 1:3, bty = "n")

Linear models

Linear models

Linear models

mod_billdepth = lm(bill_depth_mm ~ bill_length_mm * species + bill_length_mm * sex, data = penguins)
summary(mod_billdepth)
## 
## Call:
## lm(formula = bill_depth_mm ~ bill_length_mm * species + bill_length_mm * 
##     sex, data = penguins)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0740 -0.5208 -0.0642  0.4621  2.9887 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     16.25509    1.08618  14.965  < 2e-16 ***
## bill_length_mm                   0.03677    0.02854   1.289 0.198476    
## speciesChinstrap                -3.84651    1.93284  -1.990 0.047419 *  
## speciesGentoo                   -6.62661    1.75771  -3.770 0.000194 ***
## sexmale                          2.21069    0.91051   2.428 0.015726 *  
## bill_length_mm:speciesChinstrap  0.07512    0.04330   1.735 0.083688 .  
## bill_length_mm:speciesGentoo     0.06389    0.04068   1.571 0.117272    
## bill_length_mm:sexmale          -0.02183    0.02070  -1.054 0.292466    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8182 on 325 degrees of freedom
## Multiple R-squared:  0.831,  Adjusted R-squared:  0.8274 
## F-statistic: 228.3 on 7 and 325 DF,  p-value: < 2.2e-16
anova(mod_billdepth)
## Analysis of Variance Table
## 
## Response: bill_depth_mm
##                         Df Sum Sq Mean Sq  F value Pr(>F)    
## bill_length_mm           1  67.30   67.30 100.5327 <2e-16 ***
## species                  2 920.55  460.27 687.6071 <2e-16 ***
## sex                      1  79.72   79.72 119.0908 <2e-16 ***
## bill_length_mm:species   2   1.60    0.80   1.1982 0.3031    
## bill_length_mm:sex       1   0.74    0.74   1.1118 0.2925    
## Residuals              325 217.55    0.67                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Comparing proportions

Comparing proportions

# save it in a variable so we can use the counts later
tab = table(penguins_gentoo$sex)
tab
## 
## female   male 
##     58     61
prop.test(tab)
## 
##  1-sample proportions test with continuity correction
## 
## data:  tab, null probability 0.5
## X-squared = 0.033613, df = 1, p-value = 0.8545
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.3953483 0.5802663
## sample estimates:
##        p 
## 0.487395

Tests of association: n × n Tables

Nesting holes of black-backed woodpeckers.

woodpecker = read.csv("https://raw.githubusercontent.com/flee-group/vu_datenanalyse/main/datasets/woodpecker.csv")
head(woodpecker)
##   forest_type nest_tree
## 1      burned     birch
## 2      burned     birch
## 3      burned jack pine
## 4      burned     aspen
## 5      burned     birch
## 6      burned jack pine

table(woodpecker)
##            nest_tree
## forest_type aspen birch jack pine
##    burned       6    16         2
##    unburned    29    18        23

We want to test for an association between the two variables (forest type and nest tree)

\(H_0\): Nesting tree is not associated with forest type

\(H_A\): Nest tree is associated with forest type

Visualisation: Categorical Data

Barplots, or proportional bars for counts within categories

table(woodpecker)
##            nest_tree
## forest_type aspen birch jack pine
##    burned       6    16         2
##    unburned    29    18        23

woodp_plot = ggplot(woodpecker, aes(x = nest_tree,
                fill = forest_type))
woodp_plot = woodp_plot + geom_bar(width = 0.5, 
                    position=position_dodge())
woodp_plot = woodp_plot + xlab("Nest Tree Type") +
    theme_minimal() + labs(fill = "Forest Type")
woodp_plot

Side-by-side bars allow us to compare all categories on equal footing.